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Abstract: 

A new method for sequence optimization in protein models is presented. The approach, which 
has inherited its basic philosophy from recent work by Deutsch and Kurosky [Phys. Rev. Lett. 
76, 323 (1996)] by maximizing conditional probabilities rather than minimizing energy functions, 
is based upon a novel and very efficient multisequence Monte Carlo scheme. By construction, the 
method ensures that the designed sequences represent good folders thermodynamically. A bootstrap 
procedure for the sequence space search is devised making very large chains feasible. The algorithm 
is successfully explored on the two-dimensional HP model with chain lengths TV = 16, 18 and 32. 
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The "inverse" of protein folding, sequence optimization, is of utmost relevance in the context of 
drug design. This problem, which amounts to finding optimal amino acid sequences given a target 
structure, has also been investigated in the context of understanding folding properties of coarse- 
grained models for protein folding. Such models are described by energy functions E{r,cr), where 
r = {ri, r2, .., r^r} denotes the amino acid coordinates and cr — {cri, (T2, .., crjv} the amino acid 
sequence. 

Good folding sequences fold fast and in a stable way into the desired target structure. A brute 
force search for sequences meeting these criteria is prohibitively time-consuming even in minimalist 
models for protein folding. Although it has been possible to apply this type of criteria to a simple 
helix-coil model it is essential to find more efficient strategies. A fairly drastic simplification was 
proposed in , where the problem is approached by minimizing E with respect to a with r clamped 
to the target structure, tq. This method is very fast since no exploration of the conformational space 
is involved, but, unfortunately, it fails for a number of examples (see e.g. (l], ^, Q). Recently, a more 
generic scheme was suggested which aims at optimizing the conditional probability P(ro|cr), i.e. 
the Boltzmann weight, rather than E{ro, a). This approach has the advantage that entropy effects 
are taken into account, but its usefulness is not obvious since maximizing P{rQ\a) is a non-trivial 
task. In fact, the calculations in |^ involved simplifying assumptions about both the form of P(ro|cr) 
and the conformational space. In this letter we present a practical Monte Carlo (MC) procedure 
for performing the maximization of P(ro|cr). 

Thermodynamical characteristics for good folders are that the ground state minima are well sepa- 
rated from other states — at finite T the system spends a long time in the ground state well. In 
lattice models, where for relatively small chains the states are enumerable, this is often taken as 
non-degeneracy of ground state and that the latter is well separated from higher energy states. One 
expects that working with finite T distributions in the matching process singles out those optimal 
sequences that have good folding properties in terms of non-degeneracy. Indeed, in when explor- 
ing the technique on lattice models, superior results are obtained when comparing with what was 
obtained in where E{rQ,a) was minimized. 

Computationally, straightforward MC approaches for maximizing P(ro|cr) are extremely tedious. 
Our novel MC methodology is based on the multisequence method Q, where both sequence and 
coordinate degrees of freedom are subject to simultaneous moves. The basic idea is to perform 
a single simulation of a joint probability distribution P(r, cr) rather than repeated simulations of 
P{r\(T) for different fixed a. Hence, our approach is fundamentally different from that of [|. 

Our method for maximizing P(r|cr) is explored on the two-dimensional HP lattice model j6j with 
chain lengths N = 16, 18 and 32. For A^=16 we study an example used in ^ ^. The results 
for both iV=16 and 18 are checked against exact enumerations, whereas for = 32 we use a 
target structure constructed "by hand" . Our method reproduces the exact results extremely rapidly 
whenever comparisons are feasible. Furthermore, the method has quite some potential to deal with 
very large chains. 

The problem of finding thermodynamically optimal sequences given a target structure Tq is simple 
to formulate mathematically — maximize with respect to a the conditional probability 
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Z{a)=Y,eM-E{r,cj)/T) 
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where Eq. can be rewritten in terms of the free energy F{a) = —TlnZ{a) as 

P(ro|a) = exp[-iEiro,<j)-F{a))/T] 



(3) 



Hence, for each a one needs to estimate P(ro|cr) which in turn involves a sum over aU possible r. 
The situation is shown in Fig. |l| where the horizontal line represents the region probed in protein 
folding. In the simplified approach to the inverse problem j2|, minimizing E{rQ, cr), one works along 
the vertical line. Maximizing P(ro|cr) is a real challenge since it requires sampling of the entire 
(r, (T)-plane. Refs. M, 0] approached this problem by using simulated annealing in sequence space. 



Figure 1: The (r, o')-plane (see text). 

The key difficulty then is to estimate the partition function Z{a) (Eq. (||)). In ||^ this was done 
using the lowest-order term in the cumulant expansion of F(a). This approximation is valid at high 
temperature but it is unclear how good it is in the temperature regime of interest here. Ref. Q, on 
the other hand, used a chain growth MC method to estimate Z{a). In this case one has a nested 
MC, where the inner part by itself is far from trivial. 



In y these methods were successfully tested on examples where a simple minimization of £^(^0, cr) 
along the vertical line in Fig. |] fails. The chains were short enough for the results to be tested against 
exact enumerations. Another difference between optimizing P(ro\a) versus E(ro, cr) is that the latter 
requires an optimization constrained to a preset net hydrophobicity. 



In this letter we take a quite different path capitalizing on the multisequence method g. Here the 
basic strategy is to create an enlarged configuration space; the sequence a becomes a dynamical 
variable. Hence, r and a are put on a more equal footing, which, in particular, enables us to avoid 
a nested MC. 



Our starting point is the joint probability distribution 
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where {5(0")} is a set of tunable parameters, which govern the marginal distribution 

P(a) = J2 Pir, ^)^^ cxp(~5(c7))Z(a) (6) 

r 

From the Bayes relation P{r,a) ~ P{r\a)P{(j) one obtains the desired conditional probabilities, 
Eq. (|l|), which of course are independent of the choice of g{(T). 

The choice of g{a) is crucial for the efficiency of the method. At first sight, it may seem that one 
would need to estimate Z{a) in order to obtain reasonable .g(cr). However, a convenient choice is 

g{a)^-E{ro,a)/T (7) 

Maximizing P(ro|cr) in Eq. corresponds to minimizing the quantity 

AFa = -TlnPira\a)=E{ro,c7)-F{a) (8) 

which for the choice in Eq. (|^) can be rewritten as 

AFo = rinP(cr) +TlnZ (9) 

Hence, neglecting an unimportant constant, AFq can be obtained directly from the marginal distri- 
bution P{a). 

The joint distribution P(r, a) can be simulated by using separate ordinary r and a updates. This 
single simulation of P{r, a) replaces simulations of P{r\a) for a number of different fixed a. This is 
quite convenient. However, it should be stressed that the major motivation for using this scheme 
is its efficiency. In fact, it was demonstrated in that the simulation of P{r, a) can be much 
faster than the simulation of P{r\(T) even for a single cr; the exploration of the conformational space 
becomes more efficient when the sequence is allowed to fluctuate. 



The number of sequences that can be studied in a multisequence simulation is of course limited. It 
is therefore desirable to incorporate a step in which "bad" sequences are removed. This elimination 
step can be formulated in different ways. In our calculations a sequence a is removed as soon 
as some structure r ^ tq is encountered for which E(r,a) < E(ro,a). This process is continued 
until the remaining sequences can be studied through a final multisequence run. If the elimination 
proceeds for a sufficiently long time, then the surviving sequences are, by construction, those that 
have the desired structure as their non-degenerate ground state. It should be pointed out that the 
elimination process serves two purposes. In addition to bringing down the number of sequences to a 
manageable level, it also tends to make the distribution P{a) more uniform, which is instrumental 
for the efficiency of the final multisequence run. Note that for a set of sequences all having vq as 
their unique ground state, one has ex-p{—g{(T))Z{a) w 1 to leading order at low T, independent of 
cr, which implies that P{a) becomes uniform in the zero temperature limit. 

We have performed extensive numerical explorations on the HP model |^ for a variety of sizes and 
target structures and find that the approach consistently identifies the appropriate sequences in an 
efficient way. In this letter we report on results for iV = 16 and 18, which have been previously used 
for evaluating design algorithms [^, Q . Also results for TV = 32 are presented in some detail. 

The HP model § is defined by 

Eir,a) = -Y,a^<^JMn-rJ) (10) 

i<j 
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Figure 2: Target structures for (a) = 18 and (b) = 32. Symbols are explained in the text. The 
best sequence found for (b) is: HHPPHHPPPPHPHPPPPHPHPPPPHHPPHHHH. 

where A{ri — Vj) = 1 if and rj are nearest neighbor monomers but non- adjacent along the chain 
and zero otherwise. Dependent upon whether ai is hydrophobic (H) or polar (P), one has (7^=1 
and 0, respectively. We work on the square lattice, for which it is known that sequences with 
non-degenerate ground states are not too rare 0. All simulations are performed using standard 
Metropolis steps |^ in a. In r we use three types of elementary moves: one-bead, two-bead and 
pivot 1^, . A sweep refers to a combination of these three move types followed by one attempt to 
update a. 

We first test our method for the = 16 target structure studied in Q (see Fig. 1 in There is 
one sequence which has this structure as its non-degenerate ground state, as can be shown by exact 
enumeration. It turns out that the design procedure in [Q is able to find this sequence, while the 
methods in || fail to do so. Our calculation is carried out starting from the set of all 2^^ possible 
sequences. After 8000 MC sweeps, corresponding to a few CPU seconds on a DEC Alpha 200, all 
sequences except the correct one have been removed. 

Next, wc turn to the = 18 target structure shown in Fig. ^a. The seven sequences listed in Table |^ 
were found by the design procedure. Non-degeneracy is tested and confirmed as in the A^ = 16 case 
above. The multisequence simulation is then continued with only the seven sequences left in order 
to estimate P{cr) and P{ro\a). As can be seen from Table the results agree very well with the 
exact results. Since all seven sequences have vq as their unique ground state, -P(cr) is constant for 
T = 0. At T = 1/3, the temperature studied here, the P(cr)'s are not perfectly equal, but similar 
enough to allow for good mobility in sequence space. In summary, our method has removed those 
sequences that do not have rp as their unique ground state, and it also provides P{rQ\a) for all the 
surviving sequences. Note that the remaining sequences all have the same monomer type at 12 of 
the 18 positions. These are marked by filled (H) and open (P) circles in Fig. ||a. In the second 
part of our simulations, where P{rQ\a) is estimated, it is clear that the stochastic sequence moves 
are essential. How useful these moves are in the first part, the elimination process, is less clear. 
To investigate this, we performed calculations both with and without these moves. In the latter 
case the simulated sequence is replaced only if it is to be removed from the simulation, and is then 
replaced by a randomly chosen sequence among the remaining ones. In Fig. ^ we show the number 
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Sequence MC exact 

PHPPPHPPHPPHHPPPHP 0.2101(08) 0.2112 

PHPPHHPPHPPHHPPPHP 0.0625(09) 0.0617 

PHPPPHPPHPHHHPPPHP 0.3104(18) 0.3113 

PHPPPHPPHPPHHHPPHP 0.0495(03) 0.0495 

PHPPPHPHHPPHHHPPHP 0.1765(19) 0.1757 

PHPPPHPPHPPHHHPHHP 0.1102(07) 0.1110 

PHPPPHPPHPHHHHPPHH 0.0807(19) 0.0797 

Table 1: P(cr) for those seven iV = 18 sequences that design the structure shown in Fig. ||a (T = 1/3). 
Listed are both the results from our multisequence simulation (MC) and the exact results, obtained 
by enumeration. 
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Figure 3: The number of MC sweeps needed to single out the seven sequences in Table 0. Each data 
point is an average over 50 experiments, each started from the full set of all 2^^ sequences. Shown 
are both results obtained with (x) and without (o) stochastic sequence moves. 



of MC sweeps needed to remove all sequences except those in Table as a function of l/T. The 
results show that the required number of sweeps can be reduced by more than a factor of 10 by 
adding the stochastic sequence moves. Furthermore, the efficiency is less T dependent. The cost of 
the sequence moves is negligible. 

We next turn to the iV = 32 target structure shown in Fig. ^b, which is designed by hand since 
exhaustive enumeration is impracticable for this problem size. It is readily verified that this structure 
represents the minimum energy for the sequence with H at the filled circles and P at all the other 
positions along the chain (see Fig. ||b). As with any other method, it is for large chains not feasible 
to explore the entire sequence space with our multisequence method. However, as in the iV = 18 
example above, a given structure typically exhibits several positions where Ui is effectively frozen 
to H or P (see e.g. |l^, 0). It turns out that such positions can be easily detected by means of a 
trial run. This leads us to a two-step bootstrap procedure. 
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Figure 4: Average of against i for the surviving sequences from 10 runs, each started with a set of 
10^ random sequences {N = 32). The upper and lower hues represent cr^^-' and a^'^\ respectively (see 
text). The length of each run is 20000 MC sweeps (45 CPU seconds) and the number of surviving 
sequences varies between 15 and 57. 



The first step amounts to picking sets of random sequences and gauging ai for the surviving se- 
quences. The ai profile obtained this way for the target structure in Fig. |p is shown in Fig. |4[ from 
which it is clear that indeed many at exhibit a clear preference for either P or H. Based on this, 
we divide the positions along the chain into three groups corresponding to ai > ct'"'^' (filled circles 
in Fig. 1^), ai < cr^^^ (open circles) and cr'-^-' < ai < a^^^ (crosses), as indicated in Fig. ^. We then 
rerun the algorithm with those ai in the first two groups clamped to H {ai — 1) and P {ai = 0), 
respectively, and those in the third group left open, which corresponds to a set of 2^^ sequences. In 
20 CPU minutes (5 • lO*' MC sweeps) this set was reduced to 200 sequences. These can be readily 
studied in a final multisequence run, and the best sequence found is given in Fig. |. Stabihty was 
confirmed by repeating the procedure for different random seeds. We also performed runs where the 
elimination process was continued for much longer, which finally contained 167 surviving sequences. 
While it could be that this is still not the proper asymptotic value, we feel confident that the best 
sequence is correctly identified by our method. 

It is clear that this method can be generalized to a corresponding multi-step procedure for very 
large chains. 

In summary, we have developed a new efficient MC method for protein design by maximizing 
conditional probabilities using the multisequence method. The method circumvents calculations of 
partition functions by a judicious choice of the multisequence sample weights. Large chains are 
feasible with the approach by means of a bootstrap procedure that limits the search in sequence 
space. The method, which is successfully explored on two-dimensional lattice models, can easily 
be used in off-lattice models [ p^ . An alternative to prune cr-space by removing E{r,a) < E{rQ,a) 
sequences is to discard high P{a) sequences (see Eq. (^)). 
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